Working directory is now: /Users/maggieshi/Documents/GitHub/winter2026/lectures/spatial
February 9, 2026
sjoin to compare/combine two GeoDataFramespoints_from_xy to convert regular dataset with xy coordinates into spatialspatial/ folderpreprocessing.pyspatial/ folderspatial/ folderdata/raw-data/: where raw datasets are storeddata/derived-data/: where derived datasets are storedpreprocessing.py: code to process raw into derived dataREADME.md: documents original data sources for raw data and explains how they are processedspatial/raw-data/ folderspatial/raw-datapreprocessing.py: set pathWorking directory is now: /Users/maggieshi/Documents/GitHub/winter2026/lectures/spatial
preprocessing.pyExample of code chunk to create lower_48_states.geojson
output_path = "data/derived-data/lower_48_states.geojson"
if not os.path.exists(output_path):
# code to process lower_48_states.geojson would be here
print(f"GeoDataFrame successfully exported to: {output_path}")
else:
print(f"Skipping: {output_path} already exists.")Skipping: data/derived-data/lower_48_states.geojson already exists.
derived-data/README.mdpreprocessing.py, and build checks to reduce redundancyREADME.mdgeopandas package can read in .geojson’s, in the same way that it reads in .shp and .gpkggeojson: JSON data type, but with spatial features like geometry and coordinatesiso_gdf is a GeoDataFrame: a dataframe with spatial features + methods
Coordinate reference system: EPSG:4326
Source: epsg.io/4326
.geojson, the CRS is embedded directly into the file OBJECTID ID NAME \
0 1 56669 Midcontinent Independent Transmission System O...
1 2 59504 Southwest Power Pool
2 3 14725 Pjm Interconnection, Llc
3 4 5723 Electric Reliability Council Of Texas, Inc.
4 5 2775 California Independent System Operator
ADDRESS CITY STATE ZIP TELEPHONE \
0 720 WEST CITY CENTER DRIVE CARMEL IN 46032 (317) 249-5400
1 201 WORTHEN DRIVE LITTLE ROCK AR 72223 (501) 614-3200
2 2750 MONROE BOULEVARD AUDUBON PA 19403 (610) 666-8980
3 7620 METRO CENTER DRIVE AUSTIN TX 78744 (512) 225-7000
4 250 OUTCROPPING WAY FOLSOM CA 95630 (916) 351-4400
COUNTRY NAICS_CODE ... SOURCE \
0 USA 2211 ... FERC 714, EIA 860, EIA 861, TIGER/Line Shapefi...
1 USA 2211 ... FERC 714, EIA 860, EIA 861, TIGER/Line Shapefi...
2 USA 2211 ... FERC 714, EIA 860, EIA 861, TIGER/Line Shapefi...
3 USA 2211 ... FERC 714, EIA 860, EIA 861, TIGER/Line Shapefi...
4 USA 2211 ... FERC 714, EIA 860, EIA 861, TIGER/Line Shapefi...
SOURCEDATE VAL_METHOD VAL_DATE \
0 Wed, 24 Jun 2020 00:00:00 GMT OTHER Fri, 28 May 2021 00:00:00 GMT
1 Wed, 24 Jun 2020 00:00:00 GMT OTHER Fri, 28 May 2021 00:00:00 GMT
2 Wed, 24 Jun 2020 00:00:00 GMT OTHER Fri, 28 May 2021 00:00:00 GMT
3 Wed, 24 Jun 2020 00:00:00 GMT OTHER Fri, 28 May 2021 00:00:00 GMT
4 Wed, 24 Jun 2020 00:00:00 GMT OTHER Fri, 28 May 2021 00:00:00 GMT
WEBSITE PEAK_LOAD AVG_LOAD YEAR \
0 https://www.misoenergy.org/ 116627 97191 2019
1 https://www.spp.org/ 50510 41783 2019
2 http://www.pjm.com/ 151499 127285 2019
3 http://www.ercot.com/ 74819 61881 2019
4 http://www.caiso.com/ 44145 34427 2019
GlobalID \
0 29835694-4676-44ed-84cd-b74a88642762
1 d2048dd2-fd6f-4d09-861a-29582d9f9b23
2 cd4cfe9b-621d-4f62-98fb-71a9efcfbdee
3 699ab577-b395-40c5-a742-e0f9dd88e546
4 8b2c5193-a60e-4249-96fb-80877fd919fa
geometry
0 MULTIPOLYGON (((-90.86519 29.04521, -90.86601 ...
1 MULTIPOLYGON (((-104.05317 45.46544, -104.2188...
2 MULTIPOLYGON (((-76.74164 34.93144, -76.74147 ...
3 MULTIPOLYGON (((-97.22168 25.98203, -97.22194 ...
4 MULTIPOLYGON (((-124.16609 40.80788, -124.1661...
[5 rows x 21 columns]
Discussion question: what observations can you make after visualizing the geodata that you couldn’t before?
geopandas
gpd.read_file()plot() for basic visualization.geojson data type is a similar alternative to .shp and .gpkggeopandas how to project dataGeoDataFrame have a variety of methods that can…
centroid, unionlength, areasjoinpydecksjoinEnd goal:
First, let’s load and inspect the state map we want to join with MISO: lower_48_states.geojson
us_states_gdf = gpd.read_file('data/derived-data/lower_48_states.geojson')
us_states_gdf.crs = iso_gdf.crs
print(us_states_gdf.head()) id name density geometry
0 01 Alabama 94.65 POLYGON ((-87.3593 35.00118, -85.60668 34.9847...
1 04 Arizona 57.05 POLYGON ((-109.0425 37.00026, -109.04798 31.33...
2 05 Arkansas 56.43 POLYGON ((-94.47384 36.50186, -90.15254 36.496...
3 06 California 241.70 POLYGON ((-123.23326 42.00619, -122.37885 42.0...
4 08 Colorado 49.33 POLYGON ((-107.91973 41.00391, -105.72895 40.9...
nameus_states_gdf.crs = iso_gdf.crs sets the two CRS’s to matchsjoinsjoin takes two GeoDataFrames and creates a new onesjoin:
GeoDataFrame, right GeoDataFramehow: left, right, innerpredicate: intersects (default), contains, within, overlaps…sjoin: howhow: determines which rows of which gdf survive
left: all rows of left gdf survive, regardless of if they successfully joininner: only rows of left gdf that successfully join surviveright: all rows of right gdf survive, regardless of if they successfully joinNote: if there is a 1:m join, the sjoin will duplicate rows
sjoin: predicatepredicate captures the type of spatial join
intersects is a more generalized version of overlaps that allows for different types of object types
states_with_any_miso = gpd.sjoin(us_states_gdf,
miso_gdf,
how='inner',
predicate='intersects')
print("Object type:", type(states_with_any_miso))
print("\nGeometry type:", states_with_any_miso.geom_type.value_counts())Object type: <class 'geopandas.geodataframe.GeoDataFrame'>
Geometry type: Polygon 21
MultiPolygon 1
Name: count, dtype: int64
id name density geometry \
0 01 Alabama 94.65 POLYGON ((-87.3593 35.00118, -85.60668 34.9847...
2 05 Arkansas 56.43 POLYGON ((-94.47384 36.50186, -90.15254 36.496...
11 17 Illinois 231.50 POLYGON ((-90.63998 42.51006, -88.78878 42.493...
12 18 Indiana 181.70 POLYGON ((-85.99006 41.75972, -84.80704 41.759...
13 19 Iowa 54.81 POLYGON ((-91.36842 43.50139, -91.21506 43.501...
index_right OBJECTID ID \
0 0 1 56669
2 0 1 56669
11 0 1 56669
12 0 1 56669
13 0 1 56669
NAME \
0 Midcontinent Independent Transmission System O...
2 Midcontinent Independent Transmission System O...
11 Midcontinent Independent Transmission System O...
12 Midcontinent Independent Transmission System O...
13 Midcontinent Independent Transmission System O...
ADDRESS CITY ... \
0 720 WEST CITY CENTER DRIVE CARMEL ...
2 720 WEST CITY CENTER DRIVE CARMEL ...
11 720 WEST CITY CENTER DRIVE CARMEL ...
12 720 WEST CITY CENTER DRIVE CARMEL ...
13 720 WEST CITY CENTER DRIVE CARMEL ...
NAICS_DESC \
0 ELECTRIC POWER GENERATION, TRANSMISSION AND DI...
2 ELECTRIC POWER GENERATION, TRANSMISSION AND DI...
11 ELECTRIC POWER GENERATION, TRANSMISSION AND DI...
12 ELECTRIC POWER GENERATION, TRANSMISSION AND DI...
13 ELECTRIC POWER GENERATION, TRANSMISSION AND DI...
SOURCE \
0 FERC 714, EIA 860, EIA 861, TIGER/Line Shapefi...
2 FERC 714, EIA 860, EIA 861, TIGER/Line Shapefi...
11 FERC 714, EIA 860, EIA 861, TIGER/Line Shapefi...
12 FERC 714, EIA 860, EIA 861, TIGER/Line Shapefi...
13 FERC 714, EIA 860, EIA 861, TIGER/Line Shapefi...
SOURCEDATE VAL_METHOD VAL_DATE \
0 Wed, 24 Jun 2020 00:00:00 GMT OTHER Fri, 28 May 2021 00:00:00 GMT
2 Wed, 24 Jun 2020 00:00:00 GMT OTHER Fri, 28 May 2021 00:00:00 GMT
11 Wed, 24 Jun 2020 00:00:00 GMT OTHER Fri, 28 May 2021 00:00:00 GMT
12 Wed, 24 Jun 2020 00:00:00 GMT OTHER Fri, 28 May 2021 00:00:00 GMT
13 Wed, 24 Jun 2020 00:00:00 GMT OTHER Fri, 28 May 2021 00:00:00 GMT
WEBSITE PEAK_LOAD AVG_LOAD YEAR \
0 https://www.misoenergy.org/ 116627 97191 2019
2 https://www.misoenergy.org/ 116627 97191 2019
11 https://www.misoenergy.org/ 116627 97191 2019
12 https://www.misoenergy.org/ 116627 97191 2019
13 https://www.misoenergy.org/ 116627 97191 2019
GlobalID
0 29835694-4676-44ed-84cd-b74a88642762
2 29835694-4676-44ed-84cd-b74a88642762
11 29835694-4676-44ed-84cd-b74a88642762
12 29835694-4676-44ed-84cd-b74a88642762
13 29835694-4676-44ed-84cd-b74a88642762
[5 rows x 25 columns]
States covered by MISO (any amount): ['Alabama', 'Arkansas', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'North Dakota', 'Ohio', 'Oklahoma', 'South Dakota', 'Tennessee', 'Texas', 'Wisconsin', 'Wyoming']
For more complicated/layered spatial plots, use matplotlib
# set up plot
fig, ax = plt.subplots(figsize=(8, 6))
# plot MISO in red
miso_gdf.plot(ax=ax, color='firebrick', alpha=0.5)
# plot all states in gray boundary
us_states_gdf.boundary.plot(ax=ax, color='lightgray', linewidth=0.5)
# plot inner join states in navy boundary
states_with_any_miso.boundary.plot(ax=ax, edgecolor='navy', linewidth=1.2)
ax.set_axis_off()
plt.show()We will cover more on making production-quality maps next lecture
sjoinmatplotlib for more complicated spatial mapspoints_from_xyNote that this is not a spatial file type, but just a CSV
Index(['country', 'country_long', 'name', 'gppd_idnr', 'capacity_mw',
'latitude', 'longitude', 'primary_fuel', 'other_fuel1', 'other_fuel2',
'other_fuel3', 'commissioning_year', 'owner', 'source', 'url',
'geolocation_source', 'wepp_id', 'year_of_capacity_data',
'generation_gwh_2013', 'generation_gwh_2014', 'generation_gwh_2015',
'generation_gwh_2016', 'generation_gwh_2017', 'generation_gwh_2018',
'generation_gwh_2019', 'generation_data_source',
'estimated_generation_gwh_2013', 'estimated_generation_gwh_2014',
'estimated_generation_gwh_2015', 'estimated_generation_gwh_2016',
'estimated_generation_gwh_2017', 'estimated_generation_note_2013',
'estimated_generation_note_2014', 'estimated_generation_note_2015',
'estimated_generation_note_2016', 'estimated_generation_note_2017'],
dtype='object')
us_gppd.csvWe are interested in fuel types. From the README.txt:
So we are interested in studying primary_fuel
us_gppd.csvsummary = (
us_powerplant_df
.groupby("primary_fuel", as_index=False)["generation_gwh_2019"]
.sum()
.rename(columns={"generation_gwh_2019": "total_generation_gwh_2019"})
.sort_values(by="total_generation_gwh_2019", ascending=False)
)
print(summary) primary_fuel total_generation_gwh_2019
3 Gas 1.577508e+06
1 Coal 9.357724e+05
6 Nuclear 8.071456e+05
13 Wind 2.946884e+05
5 Hydro 2.804984e+05
10 Solar 7.142216e+04
12 Waste 4.378694e+04
0 Biomass 2.570438e+04
4 Geothermal 1.554570e+04
7 Oil 1.483142e+04
9 Petcoke 6.967579e+03
2 Cogeneration 3.888141e+03
8 Other 1.155774e+03
11 Storage -6.636100e+01
us_gppd.csv as a GeoDataFramepandas dataframe up until nowlatitude and longitudeREADME.txt:us_gppd.csv as a GeoDataFrameus_powerplant_gdf = gpd.GeoDataFrame(
geometry=gpd.points_from_xy(
us_powerplant_df.longitude,
us_powerplant_df.latitude,
crs="EPSG:4326"),
data=us_powerplant_df)
print("Geometry type:", us_powerplant_gdf.geom_type.value_counts())Geometry type: Point 9581
Name: count, dtype: int64
us_powerplant_gdfus_powerplant_gdf is on the same CRS as us_states_gdfus_powerplant_gdfI've asserted that the CRS's match.
us_powerplant_gdflatitude and longitude of power plants (converted into GeoDataFrame)primary_fuel: categorical attributeprimary_fuel in different colors and plotus_powerplant_gdffig, ax = plt.subplots(figsize=(8, 6))
us_states_gdf.boundary.plot(ax=ax, color='lightgray', linewidth=0.5)
us_powerplant_gdf.plot(
ax=ax,
column='primary_fuel',
categorical=True,
legend=True,
markersize=2,
alpha=0.7
)
ax.set_axis_off()
ax.set_title("Power Plants by Primary Fuel Type", fontsize=14)
plt.tight_layout()
plt.show()Question: what is the headline of this map? Sub-messages?
The map on the previous slide is too cluttered. Let’s restrict just to the top 4 fuel types.
# restrict to top 4 fuels by total gwh
top_fuels = summary["primary_fuel"].head(4)
filtered_powerplant_gdf = us_powerplant_gdf[us_powerplant_gdf["primary_fuel"].isin(top_fuels)]
fig, ax = plt.subplots(figsize=(8, 6))
us_states_gdf.boundary.plot(ax=ax, color='lightgray', linewidth=0.5)
# loop through each fuel type and plot
for fuel_type, group in filtered_powerplant_gdf.groupby("primary_fuel"):
group.plot(ax=ax, markersize=3, label=fuel_type,alpha=0.7)
# define legend
ax.legend(title="Primary Fuel", fontsize=8, title_fontsize=8, loc="lower right", frameon=False)
ax.set_axis_off()
plt.tight_layout()
plt.show()Discussion question: what headlines/submessages come out now?
Recall from visualization lectures: an alternative to color is small multiples:
fig, axes = plt.subplots(2, 2, figsize=(6, 6), constrained_layout=True) # 2x2 subplots
axes = axes.flatten() # flatten for iteration
for ax, fuel_type in zip(axes, summary["primary_fuel"].head(4)):
group = filtered_powerplant_gdf[filtered_powerplant_gdf["primary_fuel"] == fuel_type]
group.plot(ax=ax, markersize=3, alpha=0.7, color='blue')
us_states_gdf.boundary.plot(ax=ax, color='lightgray', linewidth=0.5)
ax.set_title(fuel_type, fontsize=16)
ax.set_axis_off()
plt.show()Discussion questions: what are the headlines and submessages now?
Motivating question: create a small-multiples plot to depict the spatial distribution of power plants by fuel type (top 4) in Illinois.
Hints:
us_states_gdf to just Illinois.sjoin from previous case study id name density geometry
0 01 Alabama 94.65 POLYGON ((-87.3593 35.00118, -85.60668 34.9847...
1 04 Arizona 57.05 POLYGON ((-109.0425 37.00026, -109.04798 31.33...
2 05 Arkansas 56.43 POLYGON ((-94.47384 36.50186, -90.15254 36.496...
3 06 California 241.70 POLYGON ((-123.23326 42.00619, -122.37885 42.0...
4 08 Colorado 49.33 POLYGON ((-107.91973 41.00391, -105.72895 40.9...
Code to plot small-multiples of top 4 fuel types nationally: spatial1_dps.qmd
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import patheffects as pe # used later to format text
import shapely
# load geodata
us_states_gdf = gpd.read_file('data/derived-data/lower_48_states.geojson')
us_powerplant_df = pd.read_csv('data/derived-data/us_gppd.csv')
us_powerplant_gdf = gpd.GeoDataFrame(
geometry=gpd.points_from_xy(
us_powerplant_df.longitude,
us_powerplant_df.latitude,
crs="EPSG:4326"),
data=us_powerplant_df)
# summarize top fuels by gwh in 2019
summary = (
us_powerplant_df
.groupby("primary_fuel", as_index=False)["generation_gwh_2019"]
.sum()
.rename(columns={"generation_gwh_2019": "total_generation_gwh_2019"})
.sort_values(by="total_generation_gwh_2019", ascending=False)
)
# subset to power plants for top 4 fuels
filtered_powerplant_gdf = us_powerplant_gdf[us_powerplant_gdf["primary_fuel"].isin(summary["primary_fuel"].head(4))]
# initiate 2x2 subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10), constrained_layout=True) # 2x2 subplots
axes = axes.flatten() # flatten for iteration
# iterate through top 4 fuel power plants and plot
for ax, fuel_type in zip(axes, summary["primary_fuel"].head(4)):
group = filtered_powerplant_gdf[filtered_powerplant_gdf["primary_fuel"] == fuel_type]
group.plot(ax=ax, markersize=3, alpha=0.7, color='blue')
us_states_gdf.boundary.plot(ax=ax, color='lightgray', linewidth=0.5)
ax.set_title(fuel_type, fontsize=16)
ax.set_axis_off()
plt.show()points_from_xy converts datasets with latitude and longitude into spatial data